NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

https://doi.org/10.15607/RSS.2023.XIX.030

Gkanatsios, Nikolaos; Jain, Ayush; Xian, Zhou; Zhang, Yunchu; Atkeson, Christopher; Fragkiadaki, Katerina (July 2023, Robotics: Science and Systems 2023)

Language is compositional; an instruction can ex- press multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that gen- eralizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language- instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual- language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predi- cate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on es- tablished instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language- to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real- world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.
more » « less
Full Text Available
Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation

Lin, Xingyu; Qi, Carl; Zhang, Yunchu; Huang, Zhiao; Fragkiadaki, Katerina; Li, Yunzhu; Gan, Chuang; Held, David (January 2022, Conference on Robot Learning (CoRL))

Full Text Available
Visually-Grounded Library of Behaviors for Manipulating Diverse Objects across Diverse Configurations and Views

Yang, Jingyun; Tung, Hsiao-Yu; Zhang, Yunchu; Pathak, Gaurav; Pokle, Ashwini; Atkeson, Christopher G; Fragkiadaki, Katerina (June 2021, 5th Annual Conference on Robot Learning)

We propose a visually-grounded library of behaviors approach for learning to manipulate diverse objects across varying initial and goal configurations and camera placements. Our key innovation is to disentangle the standard image-to-action mapping into two separate modules that use different types of perceptual input:(1) a behavior selector which conditions on intrinsic and semantically-rich object appearance features to select the behaviors that can successfully perform the desired tasks on the object in hand, and (2) a library of behaviors each of which conditions on extrinsic and abstract object properties, such as object location and pose, to predict actions to execute over time. The selector uses a semantically-rich 3D object feature representation extracted from images in a differential end-to-end manner. This representation is trained to be view-invariant and affordance-aware using self-supervision, by predicting varying views and successful object manipulations. We test our framework on pushing and grasping diverse objects in simulation as well as transporting rigid, granular, and liquid food ingredients in a real robot setup. Our model outperforms image-to-action mappings that do not factorize static and dynamic object properties. We further ablate the contribution of the selector's input and show the benefits of the proposed view-predictive, affordance-aware 3D visual object representations.
more » « less
Full Text Available

Search for: All records